As always, need to load gapminder and tidyverse

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(forcats))
suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(plotly))

Part 1 of the assignment - Factor management

Elaboration for the gapminder data set: First, filter the Gapminder data to remove observations associated with the continent of Oceania. In order to get a comparison of the structure before tinkering around with gapminder, I will look at the structure of gapminder and the factorness of gapminder$continent

str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
str(gapminder$continent)
##  Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
nlevels(gapminder$continent)
## [1] 5
class(gapminder$continent)
## [1] "factor"
forcats::fct_count(gapminder$continent)
## # A tibble: 5 x 2
##   f            n
##   <fct>    <int>
## 1 Africa     624
## 2 Americas   300
## 3 Asia       396
## 4 Europe     360
## 5 Oceania     24

The other way to do this is using dplyr:

gapminder %>% 
  count(continent)
## # A tibble: 5 x 2
##   continent     n
##   <fct>     <int>
## 1 Africa      624
## 2 Americas    300
## 3 Asia        396
## 4 Europe      360
## 5 Oceania      24
no_oceania <- gapminder %>%
  filter(continent!="Oceania")

str(no_oceania)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
#after I filtered out Oceania, there are only 1680 rows compared to 1704 with Oceania not filtered, however the structure function tells me that continent is a factor with 5 levels as before.
levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
#I still have Oceania as a level using the method above

Because I still have Oceania as a level using the filter method, I will now try using the forcats_drop way.

no_oceania$continent %>% 
  fct_drop() %>% 
  levels()
## [1] "Africa"   "Americas" "Asia"     "Europe"
no_oceania
## # A tibble: 1,680 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,670 more rows
#Oceania is now removed as a factor level. After dropping Oceania, I have 1,680 rows whereas before I had 1704, so that's a sanity check to see that it worked.

Now I will re-order the continents based on aggregate population of each continent from smallest to largest

fct_reorder(gapminder$continent, gapminder$pop, max) %>% 
  levels() %>% 
  head()
## [1] "Oceania"  "Europe"   "Africa"   "Americas" "Asia"

Backwards re-order, from largest population to smallest:

fct_reorder(gapminder$continent, gapminder$pop, max, .desc = TRUE) %>% 
  levels() %>% 
  head() 
## [1] "Asia"     "Americas" "Africa"   "Europe"   "Oceania"

To verify that Asia is the most populous continent and Oceania is the least populous, I plotted the population of each continent by year. Africa and Americas are neck in neck for being second most populous.

gapminder %>% 
  mutate(pop = pop/1000000) %>% 
  group_by(continent, year) %>% 
  summarize(pop = sum(pop)) %>% 
  ggplot(aes(year, pop)) +
  geom_line(aes(color=continent))

Part 2 File I/O

I first filtered the gapminder data so that I only have the Americas data for year 2007 and named that Amer_gap.

Amer_gap <- gapminder %>%
  filter(year == 2007, continent == "Americas")

str(Amer_gap) 
## Classes 'tbl_df', 'tbl' and 'data.frame':    25 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 5 12 15 21 24 26 30 33 37 38 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ year     : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ lifeExp  : num  75.3 65.6 72.4 80.7 78.6 ...
##  $ pop      : int  40301927 9119152 190010647 33390141 16284741 44227550 4133884 11416987 9319622 13755680 ...
##  $ gdpPercap: num  12779 3822 9066 36319 13172 ...
head(Amer_gap)
## # A tibble: 6 x 6
##   country   continent  year lifeExp       pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Argentina Americas   2007    75.3  40301927    12779.
## 2 Bolivia   Americas   2007    65.6   9119152     3822.
## 3 Brazil    Americas   2007    72.4 190010647     9066.
## 4 Canada    Americas   2007    80.7  33390141    36319.
## 5 Chile     Americas   2007    78.6  16284741    13172.
## 6 Colombia  Americas   2007    72.9  44227550     7007.
write_csv(Amer_gap, "Amer_gap.csv")

Then I re-opened the CSV file and see that country and continent has turned into a character vector when it was a factor before.

df <- read_csv("Amer_gap.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   continent = col_character(),
##   year = col_integer(),
##   lifeExp = col_double(),
##   pop = col_integer(),
##   gdpPercap = col_double()
## )
df
## # A tibble: 25 x 6
##    country            continent  year lifeExp       pop gdpPercap
##    <chr>              <chr>     <int>   <dbl>     <int>     <dbl>
##  1 Argentina          Americas   2007    75.3  40301927    12779.
##  2 Bolivia            Americas   2007    65.6   9119152     3822.
##  3 Brazil             Americas   2007    72.4 190010647     9066.
##  4 Canada             Americas   2007    80.7  33390141    36319.
##  5 Chile              Americas   2007    78.6  16284741    13172.
##  6 Colombia           Americas   2007    72.9  44227550     7007.
##  7 Costa Rica         Americas   2007    78.8   4133884     9645.
##  8 Cuba               Americas   2007    78.3  11416987     8948.
##  9 Dominican Republic Americas   2007    72.2   9319622     6025.
## 10 Ecuador            Americas   2007    75.0  13755680     6873.
## # ... with 15 more rows

Now creating a new factor(subcont) with 3 levels:

df$subcont <- fct_collapse(.f = df$country, 
              "North America" = c("Canada", "United States", "Mexico", "Puerto Rico", "Trinidad and Tobago"),
              "Central America" = c("Cuba", "Dominican Republic", "Haiti", "Costa Rica", "El Salvador", "Guatemala", "Honduras", "Nicaragua", "Panama", "Jamaica"),
              "South America" = c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", "Paraguay", "Peru", "Uruguay", "Venezuela"))
             

df
## # A tibble: 25 x 7
##    country         continent  year lifeExp      pop gdpPercap subcont     
##    <chr>           <chr>     <int>   <dbl>    <int>     <dbl> <fct>       
##  1 Argentina       Americas   2007    75.3   4.03e7    12779. South Ameri…
##  2 Bolivia         Americas   2007    65.6   9.12e6     3822. South Ameri…
##  3 Brazil          Americas   2007    72.4   1.90e8     9066. South Ameri…
##  4 Canada          Americas   2007    80.7   3.34e7    36319. North Ameri…
##  5 Chile           Americas   2007    78.6   1.63e7    13172. South Ameri…
##  6 Colombia        Americas   2007    72.9   4.42e7     7007. South Ameri…
##  7 Costa Rica      Americas   2007    78.8   4.13e6     9645. Central Ame…
##  8 Cuba            Americas   2007    78.3   1.14e7     8948. Central Ame…
##  9 Dominican Repu… Americas   2007    72.2   9.32e6     6025. Central Ame…
## 10 Ecuador         Americas   2007    75.0   1.38e7     6873. South Ameri…
## # ... with 15 more rows
df$subcont
##  [1] South America   South America   South America   North America  
##  [5] South America   South America   Central America Central America
##  [9] Central America South America   Central America Central America
## [13] Central America Central America Central America North America  
## [17] Central America Central America South America   South America  
## [21] North America   North America   North America   South America  
## [25] South America  
## Levels: South America North America Central America
#Another way to do the same thing:

df %>%
  mutate(subcont = fct_collapse(.f = country, "North America" = c("Canada", "United States", "Mexico", "Puerto Rico", "Trinidad and Tobago"),
              "Central America" = c("Cuba", "Dominican Republic", "Haiti", "Costa Rica", "El Salvador", "Guatemala", "Honduras", "Nicaragua", "Panama", "Jamaica"),
              "South America" = c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", "Paraguay", "Peru", "Uruguay", "Venezuela")))
## # A tibble: 25 x 7
##    country         continent  year lifeExp      pop gdpPercap subcont     
##    <chr>           <chr>     <int>   <dbl>    <int>     <dbl> <fct>       
##  1 Argentina       Americas   2007    75.3   4.03e7    12779. South Ameri…
##  2 Bolivia         Americas   2007    65.6   9.12e6     3822. South Ameri…
##  3 Brazil          Americas   2007    72.4   1.90e8     9066. South Ameri…
##  4 Canada          Americas   2007    80.7   3.34e7    36319. North Ameri…
##  5 Chile           Americas   2007    78.6   1.63e7    13172. South Ameri…
##  6 Colombia        Americas   2007    72.9   4.42e7     7007. South Ameri…
##  7 Costa Rica      Americas   2007    78.8   4.13e6     9645. Central Ame…
##  8 Cuba            Americas   2007    78.3   1.14e7     8948. Central Ame…
##  9 Dominican Repu… Americas   2007    72.2   9.32e6     6025. Central Ame…
## 10 Ecuador         Americas   2007    75.0   1.38e7     6873. South Ameri…
## # ... with 15 more rows

Part 3 Visualization design

Before

So starting off with plotting lifeExp and gdpPercap gives me this figure. Overall, from this figure, I can see that life expectancy goes up with gdp but i don’t know anything much else about those data points e.g., which continent, population density etc.,

ggplot(gapminder, aes(gdpPercap, lifeExp)) + scale_x_log10() + 
  geom_point() 

After

p <- ggplot(gapminder, aes(gdpPercap, lifeExp)) + 
  scale_x_log10(labels = dollar_format()) + 
  scale_y_continuous(breaks=1:10 * 10, labels = comma_format()) +
  geom_point(aes(color = continent, alpha = .2)) + 
  geom_smooth() +
  labs(x = "GDP",
       y = "Life Expectancy",
       title = "Life Expectancy and GDP by Continent") +
theme_classic() +
theme(axis.text=element_text(size=12),
        axis.title=element_text(size=12))

p
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

#I think this second graph is a more interesting and informative graph. It hasclear labels. It includes a Loess line with standard error. We can where the countries from different continents land in terms of life expectancy and GPD. However, I don't know how to get rid of the alpha on the legend. 

Now I will convert the above ggplot to plotly by first loading plotly

suppressPackageStartupMessages(library(plotly))
# p %>% 
 # ggplotly()

#Using plotly has the benefits of being an interactive graph that can tell you information about each data point that you hover over. You can also compare multiple data points using "compare data over hover". You can also zoom in and out to further inspect the data points

Part 4 - Using ggsave

ggsave("hw05_plot.png", p, scale = 1, width = NA, height = NA, dpi = 600, limitsize = TRUE)
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Now I will load and embed it into the report.